Red Wine Quality Exploration by Yuan Zhou

knitr::opts_chunk$set(fig.width=12, fig.height=8, width = 200,
                      echo=FALSE, warning=FALSE, message=FALSE)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

This report explores the Red Wine Quality data set containing quality and attributs for 1599 red wines.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

From the summary, we can see that attributes like fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and sulphates have outliers since their difference between max and 3rd Qu are very big.

Univariate Plots Section

Most redwine are in the middle range of quality.

From the above two histogram, we can see that the attributes fixed.acidity and volatile.acidity are almost normally distributed. Meanwhile, according to the dataset introduction, high level of volatile acidity will lead to an unpleasant, vinegar taste, which I wonder would be an important factor that effect the judgement of quality.

The histogram of citric.acid on the first row looks skewed, after log transform, the peak is more obvious. One possible explaination is that citric acid are found in small quantities, the bandwidth are not small enough to separate different values. Also as described, citric acid can add ‘freshness’ and flavor to wine which may also be an important factor for quality rating.

The distribution of residual.sugar is close to normal, but have some very distinct outliers.

The distribution of chlorides are similar to residual.sugar. We can see that most red wine are of low level of residul sugar and chlorides.

Both distribution of free.sulfur.dioxide and total.sulfur.dioxide are skewed with distinct outliers, and they are close to normal after log transform.

Density histogram is also similar to the normal distribution in a small range close to water density.

pH values are all lower than 7 and it is normally distributed.

Sulphates are almost normal distributed with few outliers.

Histogram of alcohol looks skewed and there is not too much improvement after log transform or sqrt transform.

Univariate Analysis

What is the structure of your dataset?

There are 1599 red wines in the dataset with 12 attributes including output attribute(quality) and one sequential number(X).

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm3)
  11. alcohol (% by volume) Output variable (based on sensory data):
  12. quality (score between 0 and 10)

What is/are the main feature(s) of interest in your dataset?

The main feature is quality and I would like to explore which one or more attributes in this dataset affect the quality grading of red wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The background knowledge given by the dataset info makes me would like pay more attention to the volatile.acidity and citric.acid. Meanwhile, after univariate plots, we can notice that several attributes have few very distinct outliers, which may relates with the fact that only a small amount of red wines have very low or high quality.

Did you create any new variables from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

There are several distributions are skewed, and will be better looked after log transform except for alcohol. Also there are some distributions have very distinct outliers.

Bivariate Plots Section

From the above correlation matrix, we can notice that:

quality 1. quality positively correlates with alcohol 2. quality negatively correlates with volatile.acidity

acidity 1. fixed.acidity positively correlates with citric acid 2. fixed.acidity negatively correlates with pH 3. volatile.acidity negatively correlates with citric acid 4. citric acid negatively correlates with pH

density 1. fixed.acidity positively correlates with density 2. alcohol negatively correlates with density

The attributes metioned above may related to the rating of quality, the free.sulfur.dioxide and total.sulfur.dioxide can be excluded.

Relationship between quality and other attributes

To get a better understanding of those variables’ correlation with quality, we can create a boxplot and frequency polygram for each level of quality. Since the quality is int type which has limited number of values, we can convert it to factor for the plotting.

quality vs alcohol

It seems like only when quality>=6, higher alocohol content will lead to better quality, such positive correlation does not apply to the wine with lower quality. Meanwhile, quality 5 wines have a larger range of alcohol content with a lot of outliers..

quality vs volatile.acidity

From the two figures above, we can see that as volatile.acidity decreases, quality increases generally, which in accordance with their negative correlation. Also it reflects the dataset info “the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste”. However, there are few exceptions that may not fit the conclusion:

  1. The outliers of high quality(7, 8) red wines have same level of volatile.acidity as low quality redwines(3, 4).

  2. Some quality of 7 wines have even lower level of volatile.acidity than quality 8 wines.

Relationshp within others except quality

Next, we continue to explore the relationship within these attributes except quality.

fixed.acidity vs citric.acid

fixed.acidity vs pH

volatile.acidity vs citric acid

citric.acid vs pH

The above four plots presents the relationship among the attributes related with acidity. pH has stronger negative correlation between fixed.acidity and citric.acid, which makes sense and we can ignore pH for further analysis.

fixed.acidity vs density

density vs alcohol

From the above two plots we can see that density is highly correlated with fixed.acidity. Meanwhile, the fact that alcohol density is smaller than water’s explain the negative correlation between density and alcohol.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  1. alcohol and volatile.acidity are the most important factors for the rating of quality.
  2. fixed.acidity and citric.acid are less important factors for the rating of quality but they still have their certain influence.
  3. Attributes related with SO2, like total.sulfur.dioxide, free.sulfur.dioxide
    and sulphates are not so important for the rating of quality.
  4. After explore the relationship of attriubes related with acidity and density seperately, we can regard pH and density as the results of other attributes rather than the cause, so we can ignore these two for the further analysis.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I found it interesting that why volatile.acidity has negative correlation with fixed.acidity and postive correlation with pH at first, but after reading more about the info, it makes sense.

What was the strongest relationship you found?

Positive relationship - quality vs alcohol - fixed.acidity vs citric.acid

Negative relationship - quality vs volatile.acidity - volatile.acidity vs citric.acid

Multivariate Plots Section

From this plot, it is hardly to find the relationship with the quality. And we found that the main points are of quality 5 and 6, so let’s look back to the dataset, level 3, 4 ,8 have insufficient data for analysis. So I would like to remove theem and only focus on other levels with larger number of samples.

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

From the plots above, the distribution of different quality wine are dispersive. Also there is not too much improvement by the scale transforming like sqrt or log10, which makes it difficult to define their linear relationship.

Linear Modelling

Since it is not easy to find the appropriate the scale of these attributes, I will build the linear model with the original form of the most important attributes found by the previous analysis, alcohol, volatile.acidity, citric.acid, fixed.acidity.

## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = wine_sample)
## m2: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity, 
##     data = wine_sample)
## m3: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     citric.acid, data = wine_sample)
## m4: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity + 
##     citric.acid + fixed.acidity, data = wine_sample)
## 
## ============================================================================
##                          m1            m2            m3            m4       
## ----------------------------------------------------------------------------
##   (Intercept)           0.229         1.108***      1.067***      0.651***  
##                        (0.152)       (0.166)       (0.175)       (0.195)    
##   alcohol               0.332***      0.297***      0.297***      0.306***  
##                        (0.015)       (0.014)       (0.014)       (0.014)    
##   volatile.acidity                   -0.987***     -0.945***     -1.027***  
##                                      (0.088)       (0.105)       (0.106)    
##   citric.acid                                       0.068        -0.313*    
##                                                    (0.092)       (0.121)    
##   fixed.acidity                                                   0.055***  
##                                                                  (0.012)    
## ----------------------------------------------------------------------------
##   R-squared             0.255         0.311         0.312         0.322     
##   adj. R-squared        0.254         0.310         0.310         0.320     
##   sigma                 0.598         0.575         0.575         0.571     
##   F                   518.376       342.540       228.474       179.355     
##   p                     0.000         0.000         0.000         0.000     
##   Log-likelihood    -1371.875     -1311.941     -1311.667     -1300.544     
##   Deviance            541.720       500.589       500.408       493.128     
##   AIC                2749.751      2631.882      2633.334      2613.087     
##   BIC                2765.726      2653.183      2659.960      2645.038     
##   N                  1518          1518          1518          1518         
## ============================================================================

This model can only predict median quality well which has more samples while the others have biased prediction.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The positive relationship within the fixed.acidity and citric.acid and the negative relationship within the volatile.acidity and citric.acid do not strengthened each other when looking at the quality as seen from the plots that the samples are distributed dispersively and hard to find the boundaries to separate different quality samples.

Were there any interesting or surprising interactions between features?

I am suprised that there are still so many outliers even I subset the data with only median quality samples.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I have created a linear model with the subset of dataset only including the median quality wines(5, 6, 7) since the others have very limited samples. By adding more features to it, the R-squared increases or not change, but the deviance decreases, which shows the stablility of the model is improved. Seen from the error results, this linear model can hardly predict poor or good quality. Maybe it’s better to subset the data in a different way that contain 80 percent samples of each level of quality in training and the rest 20 percent for testing.


Final Plots and Summary

Plot One

Description One

With this plot which visualizes the correlation matrices and help us find the most correlated attributes to the quality which we interested in and it is also a convenient and direct way to find other strongly related attributes.

Plot Two

Description Two

This plot demonstrates that volatile.acidity is linearly correlated with quality, which in accordance with the dataset info “the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste”. Since the quality of this redwine dataset is the median value of at least 3 evaluations made by wine experts. From this plot we can notice that the taste plays an important role for experts judgement, which leads me to direction of taste for later analysis.

Plot Three

Description Three

After univariate analysis, bivariate analysis and multivariate analysis, I created a linear model of the most important attributes. And this plot shows the performance of the model. The upper one presents training error and the lower one is the results for the testing data.


Reflection

To be honest, I was not familiar with the red wine and its properties(also because of my poor English), so it took me a long time to understand the dataset at the beginning.

During the exploration and analysis, I found that the data itself caused few limitations. 1. The samples for good or poor quality redwines are so rare that makes them like outliers thus made it hard to get a fair conclusions. 2. From my understanding, the quality is the median of at least 3 experts’ evaluation. Are these samples from the same group of experts? Are the difference within different experts large?

The analysis procedure actually went well. By plotting the distribution of each variables, I got a basic understanding of the dataset, and then the correlation matrix helps to find the most important two factors to the quality, after deeper bivariate analysis, I choose four attributes to build the final linear model. I believe this workflow is clear and insightful to perform in the future work with almost every dataset